Module 2
University of South Florida
“Without big data, you are blind and deaf and in the middle of a freeway.” — Geoffrey Moore
“In God we trust, all others bring data.” — W. Edwards Deming
What is big data?
A rough definition of big data, by volume:
If the data fits your memory: small data
If the data is bigger than memory but less than hard disk: medium data
If the data is even greater than normal disk: big data
(Massively) Large datasets are stored in different ways.
Database
Data Warehouse
Data Lake
A well-organized file cabinet
Structured
Usually managed by single provider
Many database uses (accepts) SQL
e.g., Mergent, MSRB, IvyDB
A data pool, or a dumping site
Unstructured, not organized
Data from various sources, can be raw
Great way to store massive amount of data, quickly
e.g., AWS S3, Google Cloud Storage, Git LFS
If your data fits in memory, there’s no advantage to putting it in a database: it will only be slower and more frustrating.
-Hadley Wickham, Chief scientist @Posit
Tip
Many times we don’t have enough power to handle big data:
too big to fit in memory
You can’t use the same toolbox in R/Python
So many choices to consider
Two approaches:
Big data problems are often described as “small data problems in disguise”, meaning:
Often what we care is the subset of the large data.
When data is stored in well structured way (i.e., database)
we can bypass loading the whole data to memory
but read only what is relevant.
Database querying: Retrieve only what you need
Chunk processing
Downsampling
Cloud computing (Virtual Machines / Containers)
High performance clusters (HPC)
Well-known cloud computing providers:
Managed by Research Computing department
Two main clusters: CIRCE and Secure Cluster for sensitive data
Access through JIRA or email request
A database is a collection of data that is structured and organized.
A filing cabinet arranges items by alphabetical order:
Files starting “ABC” in the top drawer, “DEF” in the second drawer, etc.
To find Alice’s file, you’d only have to search the top draw.
For Fred, the second draw, and so on.
Relational database (RDBMS)
Traditional, often called SQL databases
Data stored in tabular form
Optimal for data not often changing
When accuracy / consistency is crucial (Financial data)
example: PostgreSQL, MySQL
Non-relational database (NoSQL)
Often called NoSQL databases
Data stored in formatted text form (e.g., JSON)
For complex and diverse, changing data to be organized
e.g.) MongoDB
Mergent FISD (Fixed Income Securities Database)
Issue table
Issuer table
Client-server DBMS: run on a server within an organization
Cloud DBMS: Similar to client-server DBMS, but on cloud
In-process DBMS: run entirely on your computer
FIN4773: Big Data and Machine Learning in Finance